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ABSTRACT 



The use of computer techniques to evaluate data in 
an attempt to find useful predictors of various 
criteria is of continuing interest. The use of 
stepwise pattern analysis to select predictors has 
shown premising results. This paper presents a 
refinement of this technique called TPAN, which allows 
the items selected to be '’tailored" tc the various 
patterns cf the previously selected items. This is 
followed ty a discussion of the results obtained using 
TPAN tc select a four-item pattern, from the responses 
to an advancement examination, that best predicts 
performance on the General Classification Test. 
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I. 



INTR ODO CT IOH 



The increasing complexity of modern society has spawned 
a concurrent proliferation of specialized tasks which people 
are reguired to perform. As the training and skill 
necessary to carry out these tasks has increased# there has 
arisen a desire to select only those most likely to succeed 
to undergo such training and perforin such tasks. Thus there 
is consideratle interest in selection procedures and methods 
of prediction of success. 

In the guest for better and better selectors for more 
and more specialized criteria, the complexity of testing 
procedures has grown. However, because of the costs of 
designing and administering large tests, methods are being 
sought to increase the validity of prediction from ever 
smaller sets cf test items. The use of large digital 
computers and improved statistical techniques has aided this 
cause considerably. 

One such technique to improve the predictive validity of 
a set of test items is called pattern analysis. Here, 
rather than aggregating the number of right and wrong 
answers into a single score, the pattern of right and wrong 
answers to the individual questions is analysed. The 
theoretical hasis of this method is discussed by Lubin and 
Ostcrne in Reference 1, and Weitzman presents a summary of 
wcrk cn it through 1973 in Reference 2. 

Folce [3] has developed pattern analysis into a 
computerized stepwise technique for selecting a subset of 
best predictors from a larger set of items. This programme 
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is called PAIK. The results of this programme compare very 
favourably tc the use of aggregate scores. It is the 
intention of this paper to present a refinement of this 
procedure which would allow '•tailoring" of the items 
selected so that the best item is selected for each pattern 
of responses tc the previously selected items. 
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II. lE CHNIQOE OF P ATT BEN ANAIYSIS 



A. GENEEAL 



Stepwise pattern analysis is a technique employed to 
select, frcn a set of binary items, a small subset that is 
the best predictor fcr some criterion. These binary items 
may reflect ccrrect or incorrect responses tc test questions 
or indicate whether or not the subject is included in a 
demographic group, e.g., black or not black, age between 25 
and 30 years cr not. The criterion can also be binary, such 
as success cr failure in training, or it may be continuous, 
e.g., final examination score. 

Whether the criterion is continuous or binary, the 
process of item selection is the same. An item is selected 
and a pattern score is computed for all possible patterns 
using that item and previously selected items. The pattern 
score is obtained by computing the mean score on the 
criterion fcr all subjects having that pattern. For 
example, on the first item, the scores of all persons having 
an incorrect (zero) response on that item are averaged to 
give the zero pattern score, and the same for all persons 
having a correct (one) response. For the second item there 
are four possible patterns: correct on both items (11), 
incorrect on both items (00) , correct on item one and 
incorrect on item two (10) , and incorrect on item one and 
ccrrect or item two (01) . In all cases, the mean criterion 
score of the subjects in each category is assigned as that 
pattern score. 
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After the pattern scores are determined, each sutject is 
assigned the pattern score appropriate to his pattern. The 
correlation between the subjects' pattern scores and their 
actual scores is then calculated. This calculation is 
repeated for each item in the set, and the item having the 
highest correlation coefficient is selected as the best item 
to be added to the subset. 

Using this method a great deal of information can be 
obtained from relatively meager data. For instance, Folce 
was able to select only seven of the 70 items in the 
Electronics Technician Selection Test and obtain a 
correlation of better than 0.8 between the pattern score and 
the final grade assignment at the Electronics Technician 
School at San Diego, California. However, it should be 
possible to get even more information from the same sized 
subset by allowing different items to be selected for 
different subsets of the sample. That is, having selected 
the first item, the sample can be divided into two groups, 
those scoring a one on that item and those scoring a zero. 
It is quite possible that the next best predictor may be 
different for each of these groups, and different from the 
best predictor for the gruop as a whole. While PAIN selects 
the next item based on the whole group, tailored pattern 
analysis would allow a different item to be selected for 
each subgroup. A computer programme called TPAN has been 
developed to select such a tailored pattern of four items. 



B. TPAN, A lAILCHED PATTERN SELECTOR 

TPAN is an ALGOL programme which will select a four-item 
pattern with the highest correlation between the pattern 
score for each individual and his actual score. 
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The progiamoe first reads one card which must contain 
the numter cf binary test items fcr each subject in the 
sample (Nlias) and the number of subjects in the sample 
(SIS) . The number of items is then passed to a FCRTBAN 
subroutine called ISPTTR to read in the data. This 
subroutine reads the complete data for one subject and 
passes back, to the main programme, the criterion score and 
an integer array of ones and zeros which are the item 
responses for that subject. Also, if it has reached the end 
of the file, the subroutine returns the actual numter of 
records it has read so that the number of subjects (NIS) can 
be updated. To reduce the amount of memory required by the 
programme, TEAR compresses the data in the response array so 
that the responses to 32 items are contained in one word. 
Twc new arrays are then formed, each having one entry for 
each subject in the sample. One array contains the 
criterion scores and the other the item responses. Each 
entry in the latter uses as many words as are required to 
certain the responses to all the items. 

Most of the work is done by the subroutine BITPICKER. 
This routine, having been passed the array of scores and 
responses, selects the item from the responses that has the 
highest correlation between pattern scores and actual 
scores . 

The subject's response to a particular item is 
determined by placing a one in a mask only in the bit 
corresponding to the item under consideration. A logical 
"and" operation is then performed with the word containing 
the subject's response. Only if his response to that one 
item was a one will the result of the "and” operation be 
other that zero. In such case his score will be added to 
the sum fcr the "one” responses and the number of "one" 
responses will be incremented. If a zero results from the 
"and", the changes will be made to the "zero" response data. 
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A runninc tctal is also kept of the sum of the squares of 
the scores cf each individual. 

When the responses of all the subjects have been checked 
the mean score for the zero and one responses is calculated, 
giving the pattern scores. These, along with the sum of the 
criterion scores and the sum of the squares of the criterion 
scores, are enough data to calculate the correlation 
coefficient. The computing formula for the Pearson 

product-moment correlation coefficient is used: 

N I (cs) (ps) - (£cs)(£ps) 

H — ' '■ ■ ' — ' 

1/2 

{£N£ps2 - ( I ps) 2 ][ N Tcs2 - (£cs)2]) 

This procedure of obtaining pattern scores and then 
calculating the correlation coefficient is repeated for each 
item in the set. The item having the highest correlation 
coefficient is selected. 

To facilitate the use of this routine for iterations 
when it is desired only to use a subset of scbjects who had 
a particular pattern, a pointing vector is used rather than 
directly usirg the arrays of scores and responses. That is: 
the subrcutice BITPICKEB is always passed the total array of 
responses and scores. It is also passed another array 
ccntaininc the positions in the main array of all subjects 
who are to be used in the calculation. This is the so 
called pointing vector. 

For example, to select the first item the pointing 
vector contains all the integers up to and including the 
total number cf subjects in the sample. Thus when the 
subroutine checks each subject whose number is in the 
pointing vector, it checks the whole set. However, as the 
data for each subject is checked, his position number is put 
into one cf two vectors depending on whether the response to 
that item is a one or a zero. These two vectors, fcr the 
item with the highest correlation coefficient, are passed 
tack to the main programme. 
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when the second item is to be picked, BIIPICKER is first 
passed the pointing vector to those subjects having a zero 
response to the first item. The subroutine will then pick 
the item having the highest correlation only for those 
subjects having the zero response. The subroutine is then 
called again but with the pointing vector for the one 
responses. Thus, a different second item may be picked for 
this subgroup. In each case two new pointing vectors are 
passed back to the main programme, pointing tc those 
subjects having a zero and those having a one response to 
the chosen second items. 

The naic body of the programme is, therefore, a series 
of calls to the subroutine BITEICKER passing it the arrays 
of scores (SCORE) and responses (RESP) , the appropriate 
pointing vector (PTR for the first item) , and the number of 
entries in that vector (NIS) . The subroutine returns two 
new pointing vectors (PTRO and PTR1 for the firsr item), as 
well as the correlation coefficient (R) , item number (ITM) , 
totals of ones (T0T1) and zeros (TOTO) , and the pattern 
scores for ones (HPS1) and zeros (MPSO) . There are also 
masks passed back and forth to indicate which items have 
already been chosen (MASKIR and MASK) and standard 
accounting data of the total number of items (NITMS) and the 
nunber cf words reguired to hold all the items at 32 items 
per word (NSEGS) . 

On the second call, the best item for those subjects 
having the zero response to item one is desired. Therefore, 
the data passed are the pointing vector PTRO and its length 
TCTO. The data returned are: correlation coefficient RO, 
item number ITMO, and the pointing vectors, totals and 
pattern scores, PTH01, TOT01, MPSOI, PTROO, TOTOO, and 
MPSOO. 

The final result of TPAN is a set cf 16 patterns 
described by the binary numbers 0000 through 1111. Each 
binary digit represents the response to one of the four 
items selected. The first item will be the same fcr all 
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patterns. There may be two different second items (one for 
each response tc item one) , four third items and eight 
fourth items. The final step in the programme is to 
calculate an overall correlation coefficient. Each subject 
is assigned the pattern score appropriate to his responses 
on the selected iteas. The correlation coefficient is then 
calculated using the same algorithm as for the individual 
items. 

An additional facility provided by TPAN is the ability 
to set bounds on the criterion scores which it is desired to 
use. This is done by including two more numbers cn the 
single ilGCL input data card; these are the upper limit of 
the desired scores and the lower limit. As the data records 
are read, each score is checked against these bounds and if 
it is outside the limits that record is rejected. The number 
cf the record is printed out, as well as the score on which 
it was rejected. After all the records have been read, the 
number in the sample is revised to allow for the rejected 
records. 

A complete listing of the ALGOL programme is contained 
in appendix £. 
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III. AN APPL ICA TION CF TP AN 



In crder to test th.e programme, TPAN was run using as 
data the results of an advancement examinaticn to pay grade 
7 for teller technicians. The source data contained the 
results fer approximately 1100 enlisted men for the 150 
items on this examination. From these responses plus an 
additional item indicating whether the race of the 
individual was black, TPAN was to select the four best items 
to predict tke subject's score on the General Classification 
Test (GCI). 

A valid range of 1 to 99 was set for the GCT scores and 
a number cf records were outside this range (the field 
contained either a zero or non-numeric data) . TPAN 
eliminated these records and the final sample contained 1024 
subjects. The results obtained from this run are given in 
table 1. 

The value of the correlation coefficients given in the 
table are these used in selection of the items and, hence, 
represent the correlation only within the subset of subjects 
having the pattern shown for the previously selected items. 
It will be noted that these correlations are all rather 
small, ranging from 0.16 to 0.50. This is to be expected, 
however, as the advancement examination is net intended to 
measure the same qualities as the GCT. This is further 
borne out by the fact that the first item chosen, that is, 
the single best indicator of performance on the GCT among 
the items considered, was item 1, race. 
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TABLE 1 



SEIiCIICN OF ITEMS FROM ADVANCEMENT EXAMINATION AS 

PREDICTORS FOR GCT 





ccrrel- 


number 


mean 


number 


mean 


pattern 


aticn 


items 


of 0 's 


. score 


of 1's 


score 


0,1 


.247 


1 


896 


47.60 


128 


41.73 


00, Cl 


.242 


1,24 


399 


45.56 


497 


49.23 


COO, 001 


.200 


1,24,7 


192 


44.08 


207 


46.93 


ooco,ooo 1 


.217 


1,24,7,104 


110 


42.75 


82 


45.87 


0010,001 1 


.256 


1 ,24,7,23 


124 


45. 48 


83 


49.10 


010,011 


.218 


1,24,104 


217 


47.39 


280 


50.66 


0100,010 1 


. 164 


1 ,24,104,93 


153 


46.65 


64 


49.16 


0110,0111 


.222 


1,24,104,86 


124 


48.80 


156 


52.15 


10,11 


.248 


1,7 


53 


38.45 


75 


44.05 


IOC, 101 


.562 


1 ,7,104 


32 


36.09 


21 


42.05 


loco, 100 1 


.453 


1 ,7,104,23 


27 


34.67 


5 


43.80 


1010,1011 


.495 


1,7,104,13 


4 


50.00 


17 


40.17 


110,111 


.341 


1,7, 120 


27 


40.89 


43 


45.83 


11C0,1101 


.450 


1 ,7,120,101 


14 


43.71 


13 


37.85 


1110,1111 


.321 


1,7,120,55 


16 


48.81 


32 


44.34 
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Even given these less than ideal circumstances, the 
overall ccrrelation for the four-item patterns was 0.47. 
This compares favourably with the figure of 0.40 i obtained 
fcr four iteas selected by PAIN. Moreover, one of the 
disadvantages of PAIN is the amount of time and computer 
memory reguired to run it. For the 1100 subject sample PAIN 
reguired 40C,000 bytes of memory and 4 minutes to run. TPAN 
on the other hand reguired only 180,000 bytes and ran in 
slightly over 3 minutes. This is partly because of the fact 
that only twc patterns are assessed on each iteration and 
partly due to the more efficient handling of the algorithm 
allowed by AIGCL. 
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IV. CON^OSI^S AND RECOMMENDATIONS 



The results c£ the study indicate that tailoricg the 
item selection in pattern analysis enables more information 
to be extracted from a four-item pattern than if straight 
stepwise selection based on the whole group is used. 
However, the actual advantage gained in terms of the amount 
of predication per test item is guestionable. There are, in 
fact, eight sets of four test items using a total of up to 
15 different items. Therefore, if TPAN were to be used to 
select items to be included in a minimum length test, its 
performance wculd have to be compared to PAIN selecting a 15 
item subset. Cn the other hand, if it is desired to Icck at 
existing cata in an attempt to predict some output, TPAN 
should presert a distinct advantage. 

There are several areas where TPAN could be improved and 
extended. Tte first is the data printed out. As mentioned, 
the correlation coefficients that are given are those within 
the subset used to pick the next item. Mote useful values 
would be the overall correlations at the end of the 
selection of all second, third, and fourth items. The final 
one is the crly one calculated at present. To accomplish 
this wculd require only the the accumulation of one cr two 
more items of data, which are already available, and two 
additional correlation calculations. 

Another shortcoming of the programme is its response 
when it reaches a point of indifference to all items, i.e., 
the correlation coefficients for all items is zero. At 
present, in this situation, the programme prints an 
obviously erroneous item number (-32) , and sets all cf the 
statistics (mean scores and totals cf zero and one 
responses) to zero. This action will disrupt the 
calculation cf the overall correlation coefficient. The most 
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reasonable corrective action in this case would he to 
terminate selection of items and, when calcutating the 
overall ccrrelaticn coefficient, use the pattern and pattern 
scores derived for the last good item. 

Increasing the number of items in the pattern presents 
no prograntticg problems. It is simply necessary to add more 
calls tc the subroutine BITPICKER, passing it the 
appropriate pointing vector. The problems encountered are 
statistical. The numbers of patterns and possible different 
items doubles with each addition of one item to the pattern. 
With 1100 subjects in the sample, there are already some 
subgroups of less than 20 subjects. The validity of items 
selected on the basis of such small samples is questionable. 

A final and very interesting area for increasing the 
scope of the programme would be to include some ability to 
manipulate continuous data as well as binary items. The 
programme could be changed to determine the correlation 
between any pair of continuous attributes of the subgroups 
having pattern responses selected by TPAN. All that would 
be required would be to read in an array or arrays of the 
values of the continuously variable data for each subject. 
Then, after each item was selected, the pointing vector 
produced by the BITPICKER subroutine could be used to select 
the appropriate subjects' data from the arrays of continuous 
variables. iach correlation coefficient thus derived would 
be for a subgroup having a particular pattern. Such a 
routine could be used to determine for which of several 
subgroups, having different patterns of responses, the 
correlation was highest. Such a programme could also answer 
other interesting questions. For example, if we select a 
subgroup having a pattern with high correlation between 
pattern and actual scores, how does it affect the 
correlation between an independent continuous variable and 
the same criterion score? 
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APPENDIX A 



IISTING OF ALGOL PROGRAMME TPAN 



The fcllcwing pages contain a listing of the source file 
of the ALGCI progtamme. The first two columns would not be 
part of an input deck, but are included to facilitate 
reading the programme. A number in the first column 
indicates when a block of code starts. The same number in 
the next column indicates the end of that blcck. 
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APPENDIX B 



LISTING OF FORTBAN INPOT ROOTINE 



The following listing is the FORTRAN input routine used 
with data supplied on the advancement examination. The file 
was 160 characters long, with race in column 6 followed by 
the responses cn the 150 questions. The GCT score was in 
columns 157 and 158. The data was read into an integer array 
and passed back to the main programme. 
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LERCUTINE INPTTR { I , SCORE , N ITMS ,NRC ) 
CINENSION I (NITMSI 
C4T4 J/Q/ 





RE4C (8, 


,100, END = 50) I, SCORE 


100 


FORMAT 1 


I5X, 15111, F2.0, IX) 




J = J + 
RETURN 


1 


50 


NPC = J 
RETURN 
ENC 
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